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Abstract 

In this paper we propose a new regression interpretation of the Cholesky factor of the covari- 
ance matrix, as opposed to the well known regression interpretation of the Cholesky factor of the 
inverse covariance, which leads to a new class of regularized covariance estimators suitable for 
high-dimensional problems. Regularizing the Cholesky factor of the covariance via this regres- 
sion interpretation always results in a positive definite estimator. In particular, one can obtain 
a positive definite banded estimator of the covariance matrix at the sa me computational cost as 
the popular banded estimator proposed bv lBickel and Levina ( 2008b ). which is not guaranteed 
to be positive definite. We also establish theoretical connections between banding Cholesky 
factors of the covariance matrix and its inverse and constrained maximum likelihood estimation 
under the banding constraint, and compare the numerical performance of several methods in 
simulations and on a sonar data example. 



1 Introduction 



Statistical inference for high-dimensional data has become increasingly necessary in recent years. 
Advances in computing have made high-dimensional data analysis possible in a number of important 
applications, including spectroscopy, fMRI, text retrieval, gene arrays, climate studies, and imaging. 
Many multivariate data analysis techniques applied to high-dimensional data require an estimate of 
the covariance matrix or its inverse; however, traditional estimation by the sample covaria nce matrix 



see 



Johnstone 



is kno wn to perform poorly when there are more variables than observations (p > n) 
(|200lh and references therein for a detailed discussion. A number of alternative estimators have 
been proposed for high-dimensional problems, many of which exploit various sparsity assumptions 
about the population covariance matrix or its inverse. 

The problems of estimating the covariance matrix and its inverse are usually considered sepa- 
rately in this context, since in high dimensions inversion is costly and not always accurate. When 
the goal is to estimate the inverse covariance matrix, also known as the concentration matrix, 
a popular method is to add the lasso (li ) penalty on the entries of the i nverse covariance ma- 



trix to the normal like lihood (jdAspremont et all 120081 : lYuan and Linl . 120071 : iRothma n et a 



Lam and Fan 



2008; 
(|2007l b 



Friedman et al. . 20081 ). which has been extended to more general penalties by 
Other estimators of the inverse covariance exploit the assumption that variables have a natural or- 
dering, and those far apart in the ordering have small partial correlations. These estimators usually 
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rely on the modified Cholesky decomposition of the inverse covariance matrix (see details in Section 
[2j since this de composition has a nice regr e ssion interpre t ation a nd regression regularizatio n can 
be applied; s ee Wu and Pourahmadi ( 20031 ). Huang et al. ( 20061 ). Bickel and Levina ( 2008bl ). and 



Levina et al.l (120081 ). 



If the covariance matrix (rather than its inverse) is of interest, a simple way to improve on 
the sample covaria n ce, bo t h theoretically a n d in practice, is to th reshold small elements to zero 



(|Bickel and Levinal . l2008al : IeI Karouil . I2OO8I : iRothman et~ail 120091 ). Jnder the assumption that 



variables are ordered and those far apart in the or dering are only weakly correlated, a better option 
is to band or taper the sample covariance matri x (jBickel and Levinal . 12004 ; iFurrer and Bengtssonl . 



2007 ; Bickel and Levina . 2008b : Cai et al. . 20081 ). These simple approaches are attractive for prob- 



lems in very high dimensions since they have a small computational cost; however, these estima- 
tors are not generally guaranteed to be positive definite, although some forms of tapering can 
guarantee positive semi-definite estimates. Alternatively, a positive definite constrained maximum 
likelihood estimator can be computed under the constraint enforcing any given pattern of zeros 
( Chaudhuri et all 2007 ). but this algorithm is only applicable when there are fewer variables than 
observations (p < n). 

In this paper we show that the modified Cholesky factor of the covariance matrix (rather than its 
inverse) also has a natural regression interpretation, and therefore all Cholesky-based regularization 
methods can be applied to the covariance matrix itself instead of its inverse to obtain a sparse 
estimator with guaranteed positive definiteness. As with all Cholesky-based regularization methods, 
this approach exploits the assumption of naturally ordered variables where variables far apart in 
the ordering tend to have small correlations. The simplest estimator in this new class is banding 
the covariance Cholesky factor. Unlike banding the covariance matrix itself, it is guaranteed to be 
positive definite, but still has the same low computational complexity. 

The rest of this paper is organized as follows: we discuss the modified Cholesky factorization 
of the covariance matrix and its regression interpretation in Section [2J Regularization techniques 
appropriate for the Cholesky factor of covariance are described in Section [3j In addition, we 
connect sparsity in the covariance matrix to sparsity in its Cholesky factor and use this to contrast 
the maximum likelihood properties of banding the Cholesky factor of covariance and banding the 
Cholesky factor of the inverse. In particular, we prove that Cholesky banding of the inverse is the 
constrained maximum likelihood estimator for normal data under the constraint that the inverse 
covariance matrix is banded. Numerical performance of regularized Cholesky-based estimators of 
the covariance is illustrated both on simulated data (Section H]) and on a spectroscopy data example 
(Section [5]). 



2 Modified Cholesky decomposition of the covariance matrix 

Throughout the paper we assume that the data X\, . . . ,X n are independent and identically dis- 
tributed p-variate random vectors with population covariance matrix £ and, without loss of gener- 
ality, mean 0. Let £ denote the sample covariance matrix (the maximum likelihood version), 



E = - 

n 



Xi-X)(Xi-X 



As a tool for regularizing the inverse covariance matrix, Pourahmadi ( 19991 ) suggested using the 
modified Cholesky factorization of This factorization arises from regressing each variable Xj 
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on Xj—i, . . . , X\ for 2 < j < p. Fitting regressions 

i-i 

X,- = ^ (-tjg)X q + £j = Xj + €j , 
q=l 

let €j denote the error term in regression j, j > 2, and let e\ = X\. Let D = var(e) be the diagonal 
matrix of error variances and T = [tj q ] the lower-triangular matrix containing regression coefficients 
(with the opposite sign), with ones on the diagonal. Then writing e = X — X = TX and using 
the independence of errors we have, 

D = var(e) = var(TX) = TET T 

and thus 

S" 1 = T T D~ 1 T. (1) 

This decomposition transforms inverse covariance matrix estimation into a regression problem, 
and hence regularization approaches for regression can be applied. In practice, the coefficients 
are computed by regressing each variable Xj on its predecessors X\ , . . . , Xj-\ (after centering 
all the variables). If these regressions are not regularized, the resulting estimate is simply E -1 . 
Banding the Cholesky factor of the inverse refers to regularizing by onl y including the immediate k 



predecessors in the regres sion, X max (i.j-fc), • • • ,Xj-i, for some fixed k (jWu and Pourahmadil . 120031 ; 
Bickel and Levinal . l2008bh . 



The modified Cholesky factorization of E can be obtained by simply inverting (JTJ) . Let L = T 1 
and rewrite X = Le. Then, 

S = var (Le) = LDL T . (2) 

Our main interest here is in the regression interpretation of this decomposition. By analogy to 
(PQ), we can interpret (J2]) as resulting from a new sequence of regressions, where each variable Xj is 
regressed on all the previous regression errors • • • , ei (rather than the variables themselves). 
For j > 2, we have the sequence of regressions, 

j-l 

Xj = l jg €q + €j = Xj + 6j . (3) 

9=1 

The decompositions above apply to the population matrices. Let X = [xi, ■ ■ ■ , x p ] be the nx p 
data matrix, where each column Xj S M. n is already centered by its sample mean. For the first 
variable, we set e\ = x\. For 2 < j < p, let lj = (lj±, . . . , ljj-i) T , Zj = [e\, . . . , e 3 -_i], and compute 
coefficients and the residual, respectively, as 

lj = argmin \\xj — Zjlj\\ 2 , 



The variances are estimated as 



Zjlj . (4) 
1„ .., 



n 



Let Z = [e±, ■ ■ ■ ,e p ] denote the nx p matrix of residuals from carrying out the regressions in ([3]) 
sequentially. Here we assume that p < n to ensure that all model matrices are of full column 
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rank; Section [3] discusses the rank deficient case when p > n. Performing the regressions in @ 
amounts to, for each j > 2, orthogonally projecting the response Xj onto the span of ex, • • • , e^-i to 
estimate Zj. After the last projection we have an orthogonal basis {ei, • • • , e p }, and the estimates 
L and D. This algorithm is nothing but a scaled version of Gram-Schmidt orthogonalization of 
the data matrix X for computing its QR decomposition, where the upper triangular matrix R is 
restricted to have positive diagonal entries. The orthonormal matrix Q is the matrix Z with its 

column vectors scaled to have unit length and R T = L(nD) 2 . If all regressions are fitted without 
any regularization, simply by least squares (as described above), the resulting estimate recovers the 
sample covariance matrix: 

t = -X T X = —R T R = LDL T . 

n n 

3 Regularized estimation of the Cholesky factor L 

It is clear that in order to improve on the sample covariance, the regressions in Q need to be 
regularized. In this section we describe several estimators that introduce sparsity in covariance 
Cholesky factor L. We also connect sparsity patterns in positive definite matrices with sparsity 
patterns in their Cholesky factors and use this to analyze the connection between banding Cholesky 
factors and constrained maximum likelihood estimation. 



3.1 Banding the Cholesky factor 

The simplest way to introduce sparsity in the Cholesky factor L is to estimate only the first k sub- 
diagonals of L a nd set the rest to zero. This a ppro ach for banding the Choles ky factor of the inverse 
was proposed by Wu and Pourahmadi ( 20031 ) and Bickel and Levina ( 2008bl ). In practice, it means 



that each variable xj is regressed on the k previous residuals [ej-k, ■ ■ ■ ,ej-i], for all j > 2. Note 
that the index j — k everywhere is understood to mean max(l, j — k). Let 1^ = (/jj-fc, • • • , ljj-\) T 
and Zj k ^ = [ej_fe, . . . , e^-i]. Then we compute, 

= argmin \ \xj - Z^l^\\ 2 , (5) 

(k) 

In each regression, the design matrix Z - has orthogonal columns, which allows ([5]) to be solved 
with at most k univariate regressions. Hence the computational cost of banding the Cholesky factor 
in this manner is 0(kpn), the same order as banding the sample covariance matrix without the 
Cholesky decomposition. To ensure that design matrices are of full rank, the banding parameter 
k must be less than min(n — l,p). Also note that while each design matrix Z^ has orthogonal 
columns, all of the residual vectors e±, . . . , e p are not necessarily mutually orthogonal; ej and ey 
are only guaranteed to be orthogonal if \j — j'\ < k. 



3.2 Connection to constrained maximum likelihood 

Given that a Cholesky-based banded estimator is always positive definite, it is natural to ask 
whether it coincides with the maximum likelihood estimator under the banded constraint. Here 
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we show that, somewhat surprisingly, the answer depends on whether the banding is applied to 
the Cholesky factor of the inverse or of the covariance matrix itself: the former estimator coincides 
with constrained maximum likelihood estimator, and the latter does not. In order to show this, 
we first establish some relationships between zero patterns in positive definite matrices and their 
Cholesky factors. 

Proposition 1. Given a positive definite matrix £ with modified Cholesky decomposition £ = 
LDL T , where L is lower triangular, for any row i and c{i) < i, an = • • • = cjj )C (j) = if and only 
if hi = ■ ■ ■ = h,c(i) = 0- 
Proof. Using the expression 

3 

m=l 

it is obvious that hi = • • • = ^ |C (i) = implies an = ■ ■ ■ = a ifi ^ = 0. 

Now assume an = • • • = a^ c (n = for s ome i. The s equential column- wise formula for 
computing the modified Cholesky factorization (jWatkind . Il99lh . is given by, for i > j, 



i-1 



dii — an ^ li m dmm > 
m=l 

1 / j ~ l \ 

lij — ~~j I &ij ^ ^ hmljmdmm J • (6) 
a 33 V m=l / 

This formula allows one to compute L one column at a time, starting from the first column. We 
proceed by induction: for the first column of L, In = an /an, hence In = 0. Assume that for some 
column u < c(i) we have In = • • • = hu = 0) then using 



, I 0~i,u+l / Hm< j u+l,m,Q'u+l,u+l I 



u u +l,u+l \ , I «u+l,u+l 

' \ m=l / 

which implies h, u +i = 0. □ 

Proposition Q] states that a Cholesky factor with banded rows of arbitrary band length (by 
band length fcj of row i we mean that fej is the smallest integer such that Uj = for all j < i — fej) 
corresponds to a covariance matrix with banded rows of the same band lengths. In particular, the 
Cholesky factor L is fc-banded if and only if the covariance matrix itself is /c-banded. An analogous 
result holds for the inverse covariance matrix f2, with rows replaced by columns. 

Proposition 2. For a positive definite matrix 17 with modified Cholesky decomposition T T D~ 1 T = 
n, where T is lower triangular, for any column j and r(j) > j, cu p j = • • • = uJ r (j)j = if and only 
tftp,j = • • • = tr(j),j = 0- 

The proof of Proposition [2] is similar to that of Proposition [TJ and is omitted. Proposition 
[2] states that the modified Cholesky factor of the inverse T with arbitrary column band lengths 
corresponds to an inverse covariance matrix £1 with the same column band lengths, and thus an 
inverse covariance matrix is /c-banded if and only if its Cholesky factor is A;-banded. 

With these propositions we can investigate maximum likelihood properties of Cholesky and 
inverse Cholesky banding. 
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Proposition 3. Banding the modified Cholesky factor T of the inverse covariance matrx Vt maxi- 
mizes the normal likelihood subject to the banded constraint, Uij = for \i — j\ > k. 

Proof. Let O/m be a symmetric positive definite matrix with k non-zero main sub-diagonals, i.e., 
= f° r I* — J I > k. The negative normal log-likelihood of x±, . . . ,x n ~ N(Q, SL7 k \), up to a 
constant, is given by, 

/(%)) = trace(Sfi (fc) ) - log |fi (fe) |, 
where / is a function of the non-zero unique parameters in Slt k \. The fc-banded constrained 
maximum likelihood estimator Cl^ satisfies X7f(Cl^) = 0. Let T^DT^T^ = Qqa. be the 
modified Cholesky decomposition of SI ha. By Proposition [21 = for \i — j\ > k. Let 

g(T(Q, -D(fc)) = /(r^D^T(fe)), where g is a function of non-zero unique parameters in (T( fc ), D( fc )). 

We continue by establishing that if V5(T( fc ), = then T^DT^T^ = fi(fc). Let h(T^, D^)) = 
T^D^T^y Denote the differential of /i in the direction u = (At, Ad) evaluated at (T^yD^), 
byV/i(T( fe) ,D (fc ))[4 Then 

V/i(T (fc) ,D (fc) )[«] = Tf k) D^A T + ^D ( -Jr (fc) - l^D^Ifo , (7) 

where At is written as a p x p matrix with non-zero entries in the same positions as the non-zero 
lower triangular entries in T/ k \, and Ad is written as a p x p diagonal matrix. Since the diagonal 
entries of Tr k \ are all equal to 1 and the diagonal entries of D/ k \ are positive, one can show by 
induction that Vh(T^,D^)[u] = implies u = 0. By the chain rule, we have that 

Vg(T {k) ,D ik) )[u] = Vf(Tf k) D$T {k) )[u] ■ Vh(T (k) , D (k) )[u) . 

Since / is convex with unique minimizer &(k) h follows that V f (T^D^T^)[u] = if and only if 
T (k) D (k) T (k) = tyfc) unlessn = 0. Hence we have that V#(T (fe) , D( k) )[u] = iff V/(T ( ^D^T (fc ))[ii] = 
0and^ ) ^}f w =ft w . 

Minimizing g(TrjA, -D(fe)), which can be expressed as, 

p / n I ( J ~ 1 

3=1 \ i=l \ v=j-k 

is equivalent to minimizing, 

n I ( jl 

9j(t(k)j,.j-ki ^(fc)j,.f-i) d( k )jj) = n log d( k )jj + ^ - — - I Xy - I > 

for each row 1 < j < p. For row j, the solution to ^gjitifyjj-ki ■ ■ ■ j d(k)jj) = gives 

exactly the ordinary least squares regression coefficients (with the opposite sign) from regressing 
Xj on Xj_ k , . . . ,Xj—i, and the sample variance of the n residuals from this fit. Thus the solution 
coincides with the output of the inverse Cholesky banding algorithm. □ 

Next, we show that banding the Cholesky factor of the covariance matrix itself does not give the 
constrained maximum likelihood estimator. This is due to the inverse being the natural canonical 
parameter of the multivariate normal distribution. 




6 



Proposition 4. Banding the modified Cholesky factor L of the covariance matrix S does not 
maximize the normal likelihood under the constraint that = for \i — j\ > k. 

Proof. We show that the first-order necessary condition for optimality is not met using p = 3 
variables. Let the function g be the negative normal log-likelihood parameterized by the inverse 
Cholesky factor T = L^ 1 and D, which is given up to a constant by, 

g (T,D)=£(T T D- 1 T) = Y,[nlogd JJ + ^^(x lJ + Y,tj^ l }\ ] (8) 

j=l \ i=l H \ v=l J J 

Consider a 3 x 3 covariance matrix £ with the banding constraint a 3 \ = CT13 = 0. This constraint 
is equivalent to Z31 = by Proposition [TJ The inverse Cholesky factor T in terms of the entries in 
the Cholesky factor L is given by, 





Minimizing the negative log-likelihood subject to Z31 = is equivalent to minimizing the uncon- 
strained function 



3 11 1 

b(hi,k2,D) = n,y]log djj + — \\xi\\ 2 + — \\x 2 - I21X1W 2 + — \\x 3 + I32I21X1 - I32X2 \\ 

"11 "22 "33 

Taking the partial derivative of b with respect to I21, 

d 1 1 

b(hi,h2,D) = — (2l 2 \xjxi - 2xfx 2 ) + — (2l 32 XiX 3 - 2l\ 2 x\x 2 + 2Z 2 iZ| 2 £C T :E iy 



and evaluating at the Cholesky banding solution gives, 

d ff f - 2l 32 xjx 3 
-b{l 2 i,k2,D) - 



C/21 d 33 

Since -^—b{l,2i,h2,D) 7^ with probability 1, the Cholesky banding solution does not satisfy the 
first-order necessary condition for being an optimum of an unconstrained differentiate function 6, 
and hence Cholesky banding does not maximize the constrained normal likelihood. □ 

The constrained ma ximum likelihood estimator can be computed by the algorithm proposed by 
Chaudhuri et alJ <|2007f >. but this algorithm only works for p < n. We are not aware of suitable con- 



strained maximum likelihood estimation algorithms for p > n, which makes banding the Cholesky 
factor a more attractive option for computing a positive definite estimator for large p. In Section 
HI we briefly compare the numerical performance of banding the Cholesky factor to the constrained 
maximum likelihood estimator when p < n, and find that the two estimators are in practice very 
close. 
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3.3 The penalized regression approach 



Instead of banding the Cholesky factor, more sophisticated regularization approaches can be applied 
to regressions involved in the computation. In general, for 2 < j < p we can estimate the Cholesky 
factor by, 

lj = argmin{||£Cj — Zjlj\\ 2 + P\(lj)}. (9) 
h 

Penalty func t ions P\ that encourage sparsity in the coefficient vector lj are of particular interest. 
Huang et al. I (l200fih applied the lasso penalty in the inverse covariance Cholesky estimation problem, 



and here we can analogously use 

j'-i 

P\{h) = *Y<\ l it\- 
t=i 

The lasso penalty function can result in zeros in arbitrary locations in the Cholesky factor, which 
may or may not lead to any zeros in the resulting covariance matrix. To impose additional structure, 
Levina et al. (l2008h proposed the nested lasso penalty, which in our context is given by, 



Px NL Hj) = A ( \l S j-i\ + jf^+ l ^ + --- + M) > (10) 

where 0/0 is defined as 0. This penalty imposes the restriction that Ijt = if ljt+i = 0- By 
Proposition [H this means that all the zeros estimated in the Cholesky factor L will be preserved in 
E. This is n ot the case in t he inv erse Cholesky decomposition for which this penalty was originally 



proposed bv lLevina et ajj (120081). a lthough some (not all) zeros are preserved in that case as well. 



In practice, Levina et al. ( 20081 ) recommend using a slightly modified version of (|10p where the 



first term is divided by the univariate regression coefficient from regressing Xj on ej-i alone, to 
address a potential difference of scales, which is the version we used in simulations. Note that both 
lasso and nested lasso have much higher computational cost than banding, and are not appropriate 
for very large p; however, the additional flexibility of the sparsity structure may work well in some 
cases. 



4 Numerical results 



In this section we present a simulation study which compares the performance of all the covariance 
estima tors discussed in Section [3l banding the sample cov ariance matrix dir e ctly ( Bickel and Levinal . 
2008bh . and, as a benchmark, the shrinkage estimator of iLedoit and Wola (|2003l ). The main differ- 
ence between banding the sample covariance directly and regularizing the Cholesky factor is the 
guaranteed positive definiteness of the latter. The Ledoit-Wolf estimator is a linear combination 
of the identity matrix and the sample covariance matrix, where linear coefficients are estimates of 
asymptotically optimal coefficients under Frobenius loss; it does not introduce any sparsity. 



4.1 Simulation Settings 

We consider two standard covariance structures for ordered variables, 
1. Ei: = (7/10)1^1 ; 
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2. S 2 : Uij = t{i = j) + (4/10)l(|* - j\ = 1) + (2/10)l(|* - j\ = 2) + (2/10)l(|i - j\ = 
3) + (l/10)l(|i-i| =4). 

The AR(1) model Si has a dense Cholesky factor while the MA(4) model £2 is a banded matrix 
with k = 4, and t h erefore its Cholesk y factor is also 4-ban ded. The model Si was considered by 
Bickel and Levinal (j2008bh . and S 2 bv lYuan and Linl (j2007h . 



We generate n = 100 training observations and another 100 independent validation observations 
from iVp(0,X). The dimensions considered were p = 30, 100, 200, 500, and 1000. Note that lasso 
and nested lasso were not run for p = 500 and 1000 due to their high computational cost. Tuning 
parameters were selected by minimizing the Frobenius norm (||M||^ = ^ jmjj) of the difference 
between the regularized estimate computed with the training observations and the sample covari- 
ance computed with the validation o bservations. Alternatively, one could select tuning parameters 
using the random-splitting scheme of I Bickel and Levinal ( 2008bl ) . which we use in the data example 



in Section [5j The whole process was repeated 50 times. 

To compare estimators, we used the operator norm loss, also known as the matrix 2-norm 
(||M|| 2 = A max (MM T )), of the difference between the covariance estimator and the truth, 

A(S,E) = E\\t- S||. 

We also compute the true positive rate (TPR) and true negative rate (TNR), defined as 

TPR(S ' s) " wwJy^-Jo} ' 

TNR(E, S) = #{(^):^ = *nd ^ = 0} 

■ °ij = 0} 

Note that the sample covariance has a true positive rate of 1, and a diagonal estimator has a true 
negative rate of 1. Additionally we measure eigenspace agreement between the estimate and the 
truth using the measure, for q = 1, . . . , p 

*(9)=EE(«w c ci)) 8 ' (13) 
i=i j=i 



introduced by IKrzanowski (jl979h . where eu\ denotes the estimated eigenvector corresponding to 



the i-th largest estimated eigenvalue, and eu\ the true eigenvector corresponding to the i-th largest 
true eigenvalue. Note that K{q) = q indicates perfect agreement of the eigenspaces spanned by the 
first q eigenvectors. 

4.2 Results 

The averages and standard errors over 50 replications of the operator norm loss for both models 
are given in Table [TJ One can see that banding the Cholesky factor provides the best performance 
in every case. It outperforms banding the sample covariance directly, particularly in high dimen- 
sions, and both banding methods outperform the Ledoit-Wolf estimator as well as both regularized 
regression methods. 



<|2007h 



The banded maximum likelihood estimator was also computed using the algorithm of lChaudhuri et al 



for p = 30 (the algorithm is only applicable when p < n). Its loss values are 1.27(0.04) for 
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Table 1: Operator Norm Loss, average(SE) over 50 replications 



P 


Sample 


Lcdoit-Wolf 


Sample Banding 


Cholesky Banding 


Lasso 


Nested Lasso 




Si 


30 


1.75(0.04) 


1.67(0.04) 


1.27(0.04) 


1.27(0.03) 


1.68(0.05) 


1.45(0.04) 


100 


4.14(0.07) 


3.06(0.03) 


1.58(0.03) 


1.56(0.03) 


3.50(0.03) 


1.78(0.03) 


200 


6.55(0.07) 


3.79(0.02) 


1.75(0.03) 


1.74(0.03) 


3.90(0.01) 


1.93(0.03) 


500 


12.57(0.08) 


4.42(0.01) 


1.95(0.03) 


1.91(0.02) 






1000 


20.65(0.09) 


4.64(0.00) 


2.08(0.03) 


2.00(0.02) 








s 2 


30 


1.44(0.03) 


1.13(0.02) 


0.77(0.02) 


0.75(0.02) 


1.23(0.02) 


0.88(0.02) 


100 


3.34(0.04) 


1.64(0.01) 


0.92(0.02) 


0.89(0.02) 


1.63(0.01) 


1.00(0.01) 


200 


5.36(0.04) 


1.78(0.00) 


0.99(0.02) 


0.93(0.02) 


1.71(0.00) 


1.08(0.01) 


500 


10.36(0.05) 


1.84(0.00) 


1.09(0.02) 


1.05(0.02) 






1000 


17.60(0.07) 


1.85(0.00) 


1.19(0.02) 


1.14(0.02) 







Si and 0.76(0.02) for £2, which are essentially the same as those for Cholesky banding for p = 30. 
As expected, the margin by which sparse regularized estimators outperform non-sparse estimators 
(the sample and Ledoit-Wolf) is larger for the sparse population covariance £2. 

For the sparse matrix £2, we also report true positive and true negative rates of estimating zeros 
in Table [2j These rates also depend on the tuning parameter (k for banding and A for the lasso and 
nested lasso). Both Cholesky banding and sample covariance banding have perfect true negative 
rates, meaning that all of the realizations had at most 4 non-zero sub-diagonals. We see a better 
true positive rate for banding the Cholesky factor than for banding the sample covariance matrix, 
which means that banding the sample tends to set more diagonals to zero than necessary. This is 
partly because the entries on the fourth sub-diagonal of £2 are quite small. The lasso method has 
a low true negative rate, which is expected since zeros in the Cholesky factor are not preserved, 
and the nested lasso does reasonably well on both but not as well as Cholesky banding. 

Table 2: True Positive/True Negative Ratea for £2 (%), average(SE) over 50 replications 



p 


Banding 


Cholesky Banding 


Lasso 


Nested Lasso 


30 

100 

200 

500 

1000 


88.18(1.69) / 100(0) 
88.68(1.75) / 100(0) 
88.59(1.77) / 100(0) 
88.04(1.78) / 100(0) 
87.02(1.78) / 100(0) 


91.00(1.78) / 100(0) 
94.09(1.50) / 100(0) 
95.04(1.42) / 100(0) 
96.01(1.31) / 100(0) 
96.51(1.24) / 100(0) 


99.71(0.08) / 3.86(0.40) 
90.69(0.27) / 36.45(0.40) 
90.76(0.18) / 34.63(0.28) 


93.89(0.75) / 88.9(0.88) 
94.44(0.36) / 97.07(0.16) 
94.01(0.32) / 98.72(0.05) 


In Figure CD we plot the 


averaged estimated 


eigenvalues in descending 


order for sample band- 



ing, Cholesky banding, the sample covariance, and the Ledoit-Wolf estimator, as well as the true 



eigenvalues, for both models and p = 1000. Since n = 100, the sample covariance matrix only has 
99 non-zero eigenvalues. Cholesky banding and sample banding perform similarly for both mod- 
els, with Cholesky banding having a slight edge for the small eigenvalues. The banding methods 
outperform both the sample covariance and the Ledoit-Wolf estimator by a considerable amount, 
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Figure 1: Scree plots (averaged over 50 replications) for p = 1000. 

especially for larger true eigenvalues. 

Since sample covariance banding does not necessarily produce a positive definite estimator, we 
also report the percentage of estimates that are positive definite in Table [3j We see that for the 
dense matrix Si, sample banding has out of 50 positive definite realizations for p > 200; for the 
sparse matrix S2, sample banding has 50 out of 50 positive definite realizations for p < 200, 49 for 
for p = 500 and 48 for p = 1000; it is clear that, for both models, the larger p, the harder it is to 
keep positive definiteness. 

Table 3: Percentage of banded sample covariance realizations that are positive definite (based on 
50 replications) 



Model 


p 


30 


100 


200 


500 


1000 


Si 


66 


8 











s 2 


100 


100 


100 


98 


96 



Finally, Figure [2] shows a plot of the averaged eigenspace agreement measure K(q) versus q, for 
p = 1000 variables, along with the line K (g) = q representing perfect eigenspace agreement. We see 
that both Cholesky banding and ordinary banding perform roughly the same under this measure; 
both outperform the sample covariance matrix and the Ledoit-Wolf estimator, which have the same 
eigenvectors, since the Ledoit-Wolf estimator is a linear combination of the sample covariance and 
the identity. 
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Figure 2: K(q) versus q (averaged over 50 replications) for p 
perfect agreement. 



1000. K(q) = q corresponds to 



5 Sonar data example 

In this section we illustrate the effects of Cholesky banding and sample covariance band ing on 
SONAR data from the UCI machine learning data repository ([Asuncion and Newmanl . l2007n . This 
dataset has 111 spectra from metal cylinders and 97 spectra from rocks, where each spectrum has 
60 frequency band energy measurements. These spectra were measured at multiple angles for the 
same objects, but following previous analyses of the dataset we assume independence of the spectra. 

The top panel of Figure [3] shows heatmaps of the absolute values of the sample correlation 
matrices for metal and rock (we standardize the variables first to facilitate comparison for metal 
and rock spectra, which are on different scales). Both matrices show a general pattern of correlations 
decaying as one moves away from the diagonal, which makes banding a reasonable option. 

The b anding parameter k for bo th banding methods was selected using the random-splitting 



scheme of iBickel and Levina (2008 



k = argmin ^ 



1 N 

-Y 



v=l 



where is the banded estimator with k bands computed on the training data, and is 
the sample covariance of the validation data. To obtain these training and validation sets, the 
data was split at random TV = 100 times, with 1/3 of the sample used for training. For metal, 
Cholesky banding and sample banding both chose k = 31 sub-diagonals; for rock, Cholesky banding 
chose k = 17 and sample banding chose k = 18. Since these values are so close, for easier visual 
comparison we show Cholesky banding and sample banding both computed with k = 17 for the 
rock spectra. The heatmaps of the absolute values of the banded estimators are shown in Figure [3j 
We see that Cholesky banding shrinks the non-zero correlations whereas the sample banding does 
not, which is the property that allows Cholesky banding to achieve positive definiteness. 
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Samp. Metal Samp. Rock 




Samp. Band, k = 31 Metal Samp. Band, k = 17 Rock 




Figure 3: Heatmaps of the absolute values of the correlation estimates. White is magnitude and 
black is magnitude 1. 

We also show eigenvalue plots for these estimators in Figure [JJ a) and (b), and the eigenspace 
agreement measure between the banded estimators and the sample covariance in Figure HJc) and 
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Figure 4: (a) Scree plots of the banded estimators, using the metal spectra; (b) Scree plots using 
the rock spectra; (c) Eigenspace agreement with the sample covariance matrix using the metal 
spectra, (d) Eigenspace agreement with the sample covariance matrix using the rock spectra. Note 
that in (c) and (d), K(q) = q, drawn as the gray dashed line, corresponds to perfect agreement, 
see (fl~3j) for the definition of K(q). 

(d) , using the agreement measure (fl~3j) . We see that the sample covariance has the most spread out 
eigenvalues, and the eigenvalues from Cholesky banding have the least spread, as we would expect. 
For eigenvectors, there are no major differences between the estimators, a result consistent with 
simulations. 

We also compared the performance of the various estimators if they are used in quadratic 
discriminant analysis (QDA) to discriminate between rock and metal. An observation x is classified 
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as rock (k = 0) or metal (k = 1) using the QDA rule, 

G(x) = argmax|ilog|fi fc | - ^(x - p, k ) T Cl k (x - p, k ) +logvr fc | , 

where nk is the proportion of class k observations in the training sample, fi k is the training class k 
sample mean vector, and Cl k is the inverse covariance estim ate computed with th e class k training 
observations. A full description of QDA can be found in iMardia et aD (jl979h . In addition to 



banding the Cholesky factor of covariance and of the inverse, we also added a diagonal estimator 
of the covariance matrix (which corresponds to the naive Bayes classifier). Leave-one-out cross 
validation was used to estimate the testing error, and the banding parameters were selected with 
10 random splits with 1/3 of the data used for training, using Frobenius loss for covariance Cholesky 
banding and the validation likelihood for the inverse covariance Cholesky banding. Banding the 
sample covariance was omitted because its lack of positive definiteness led to inversion problems. 
The test errors (%) were 24.0(3.0) for the sample covariance, 32.7(3.3) for naive Bayes, 20.2(2.8) for 
covariance Cholesky banding, and 14.9(2.5) for inverse Cholesky banding. Both banding methods 
are substantially better than either estimating the whole dependency structure by the sample 
covariance or not estimating it at all (naive Bayes), and the inverse Cholesky banding does better 
in this case because it introduces sparsity directly in the inverse covariance. 

6 Summary and discussion 

In this paper we proposed a new regression interpretation of the Cholesky factor of the covariance 
matrix, which was previously only available for the Cholesky factor of the inverse. Banding of this 
Cholesky factor gives a banded positive definite estimator of the covariance, unlike banding the 
sample covariance matrix, and was shown to perform better numerically. An attractive property 
of the banded Cholesky estimator is its low computational cost, the same as that of banding the 
sample covariance matrix itself, and thus there is no computational penalty to pay for enforcing 
positive definiteness. More complicated regularization obtained from penalties such as the lasso 
or the nested lasso can be applied using the same regression interpretation, but at an additional 
computational cost. The proposed estimators perform well numerically under a variety of measures. 

We also connected sparsity in banded Cholesky factors with sparsity in the covariance ma- 
trix and the inverse covariance matrix, which allows us to show that inverse Cholesky banding is 
equivalent to constrained maximum likelihood under the banded constraint. Banding the Cholesky 
factor of the covariance itself is not equivalent to constrained maximum likelihood, but we found 
empirically they perform similarly. In terms of convergence rates, one would expect a convergence 
result analogous to the one for inverse Cholesky banding established by lBickel and Levina (l2008bh 



to hold here as well, but this case presents substantial extra technical difficulties in analysis, due 
to the fact that the errors used as predictors in the regressions required to compute the Cholesky 
factor are unobservable and have to be estimated by residuals. Nonetheless, we expect the method 
to be equally useful based on its good practical performance. 
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